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Abstract 

The widespread and rapid adoption of high-throughput sequencing technologies has afforded re- 
searchers the opportunity to gain a deep understanding of genome level processes that underlie 
evolutionary change, and perhaps more importantly, the links between genotype and phenotype. In 
particular, researchers interested in functional biology and adaptation have used these technologies 
to sequence mRNA transcriptomes of specific tissues, which in turn are often compared to other 
tissues, or other individuals with different phenotypes. While these techniques are extremely power- 
ful, careful attention to data quality is required. In particular, because high-throughput sequencing 
is more error-prone than traditional Sanger sequencing, quality trimming of sequence reads should 
be an important step in all data processing pipelines. While several software packages for qual- 
ity trimming exist, no general guidelines for the specifics of trimming have been developed. Here, 
using empirically derived sequence data, I provide general recommendations regarding the optimal 
strength of trimming, specifically in mRNA-Seq studies. Although very aggressive quality trimming 
is common, this study suggests that a more gentle trimming, specifically of those nucleotides whose 
Phred score <2 or <5, is optimal for most studies across a wide variety of metrics. 



Introduction 

The popularity of genome-enabled biology has increased dramatically over the last few years. While 
researchers involved in the study of model organisms have had the ability to leverage the power of 
genomics for nearly a decade, this power is only now available for the study of non-model organisms. 
For many, the primary goal of these newer works is to better understand the genomic underpinnings of 
adaptive (Linnen et al., 2013; Narum et al., 2013) or functional (Hsu et al., 2012; Munoz-Merida 
et al., 2013) traits. While extremely promising, the study of functional genomics in non-model 
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organisms typically requires the generation of a reference transcriptome to which comparisons are 
made. Although compared to genome assembly transcriptome assembly is less challenging (Bradnam 
et al . , 2013; Earl et al., 2011), significant computational hurdles still exist. Amongst the most difficult 
of challenges in transcriptome assembly involves the reconstruction of isoforms (Pyrkosz et al., 2013), 
simultaneous assembly of transcripts where read coverage (=expression) varies by orders of magnitude, 
and overcoming biases related to random hexamer (Hansen et al., 2010) and GC content (Dohm 
et al., 2008). 

These processes are further complicated by the error-prone nature of high-throughput sequencing 
reads. With regards to lllumina sequencing, error is distributed non-randomly over the length of the 
read, with the rate of error increasing from 5' to 3' end (Liu et al., 2012). These errors are 
overwhelmingly substitution errors (Yang et al., 2013), with the global error rate being between 1% 
and 3%. Although de Bruijn graph assemblers do a remarkable job in distinguishing error from correct 
sequence, sequence error does results in assembly error (MacManes and Eisen, 2013). While this type 
of error is problematic for all studies, it may be particularly troublesome for SNP-based population 
genetic studies. In addition to the biological concerns, sequencing read error may results in problems 
of a more technical importance. Because most transcriptome assemblers use a de Bruijn graph 
representation of sequence connectedness, sequencing error can dramatically increase the size and 
complexity of the graph, and thus increase both RAM requirements and runtime. 

In addition to sequence error correction, which has been shown to improve accuracy of the de novo 
assembly (MacManes and Eisen, 2013), low quality (=high probability of error) nucleotides are 
commonly removed from the sequencing reads prior to assembly, using one of several available tools 
(Trimmomatic (Lohse et al., 2012), Fastx Toolkit 

(http : //hannonlab . cshl . edu/f astx_toolkit/index . html), or BIOPIECES 
(http://www.biopieces.org/)). These tools typically use either a sliding window approach, 
discarding nucleotides falling below a given (user selected) average quality threshold, or trimming of 
low-quality nucleotides at one or both ends of the sequencing read. Though the absolute number will 
surely be decreased in the trimmed dataset, aggressive quality trimming may remove a substantial 
portion of the total read dataset, which in transcriptome studies may disproportionately effect lower 
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expression transcripts. 

Although the process of nucleotide quality trimming is commonplace, particularly in the 
assembly-based HTS analysis pipelines (e.g. SNP development (Helyar et a I . , 2012; Milano et al., 
2011), functional studies (Ansell et al., 2013; Bhardwaj et al., 2013), and more general studies of 
transcriptome characterization (Liu et al., 2013; MacManes and Lacey, 2012)), its optimal 
implementation has not been well defined. Though the rigor with which trimming is performed may be 
guided by the design of the experiment, a deeper understanding of the effects of trimming is desirable. 
As transcriptome-based studies of functional genomics continue to become more popular, 
understanding how quality trimming of mRNA-seq reads used in these types of experiments is urgently 
needed. Researchers currently working in these field appear to favor aggressive trimming (e.g. (Looso 
et al., 2013; Riesgo et al., 2012)), but this may not be optimal. Indeed, one can easily image 
aggressive trimming resulting in the removal of a large amount of high quality data (even nucleotides 
removed with the commonly used Phred=20 threshold are accurate 99% of the time), just as 
lackadaisical trimming (or no trimming) may result in nucleotide errors being incorporated into the 
assembled transcriptome. 

Here, I provide recommendations regarding the efficient trimming of high-throughput sequence reads, 
specifically for mRNASeq reads from the lllumina platform. To do this, I used publicly available 
datasets containing lllumina reads derived from Mus musculus. Subsets of these data (10 million, 20 
million, 50 million, 75 million, 100 million reads) were randomly chosen, trimmed to various levels of 
stringency, assembled then analyzed for assembly error and content. In addition to this, I develop a set 
of metrics that may be generally useful in evaluating the quality of transcriptome assemblies. These 
results aim to guide researchers through this critical aspect of the analysis of high-throughput 
sequence data. While the results of this paper may not be applicable to all studies, that so many 
researchers are interested in the genomics of adaptation and phenotypic diversity, particularly in 
non-model organisms suggests its widespread utility. 
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Materials and Methods 

Because I was interested in understanding the effects of sequence read quality trimming on the quality 
of vertebrate transcriptome assembly, I elected to analyze a publicly available (SRR797058) paired-end 
lllumina read dataset. This dataset is fully described in a previous publication (Han et al., 2013), and 
contains 232 million paired-end lOOnt lllumina reads. To investigate how sequencing depth influences 
the choice of trimming level, reads data were randomly subsetted into 10 million, 20 million, 50 
million, 75 million, 100 million read datasets. To test the robustness of my findings, I evaluated a 
second dataset (SRR385624, Macfarlan et al. (2012)) as well as a technical replicate of the primary 
dataset, both at the 10M read dataset size. 

Read datasets were trimmed at varying quality thresholds using the software package Trimmomatic 
version 0.30 (Lohse et al., 2012), which was selected as it appears to be amongst the most popular of 
read trimming tools. Specifically, sequences were trimmed at both 5' and 3' ends using Phred =0 
(adapter trimming only), < 2, < 5, < 10, and < 20. Other parameters (MINLEN=25, 
ILLUMINACLIP=barcodes.fa:2:40:15, SLIDINGWINDOW size=4) were held constant. Transcriptome 
assemblies were generated for each dataset using the default settings (except group_pairs_distance flag 
set to 999) of the program Trinity R2013-02-25 (Grabherr et al., 2011; Haas et al., 2013). 
Assemblies were evaluated using a variety of different metrics, many of them comparing assemblies to 
the complete collection of Mus cDNA's, available at 
http : //useast . ensembl . org/ info/data/ ftp/ index . html. 

Quality trimming may have substantial effect on assembly quality, and as such, I sought to identify 
high quality transcriptome assemblies. Assemblies with few nucleotide errors relative to a known 
reference may indicate high quality. The program Blat v34 (Kent, 2002) was used to identify and 
count nucleotide mismatches between reconstructed transcripts and their corresponding reference. To 
eliminate spurious short matches between query and template inflating estimates of error, only unique 
transcripts that covered more than 90% of their reference sequence were used. Next, because kmers 
represent the fundamental unit of assembly, kmers (k=25) were counted for each dataset using the 
program Jellyfish vl.1.11 (Marcais and Kingsford, 2011). Another potential assessment of assembly 
quality may be related to the number of paired-end sequencing reads that concordantly map to the 
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89 assembly. As the number of reads concordantly mapping increased, so does assembly quality. To 

90 characterize this, I mapped the full dataset (not subsampled) of adapter trimmed sequencing reads to 

91 each assembly using Bowtie2 v2.1.0 (Trapnell et al., 2010) using default settings, except for maximum 

92 insert size (-X 999) and number of multiple mappings (-k 30). 

93 Aside from these metrics, measures of assembly content were also assayed. Here, open reading frames 

94 (ORFs) were identified using the default settings of the program TransDecoder R20131110 

95 (http://transdecoder.sourceforge.net/), and were subsequently translated into amino acid 

96 sequences, both using default settings. The larger the number of complete open reading frames 

97 (containing both start and stop codons) the better the assembly. Next, unique transcripts were 

98 identified using the blastP program within the Blast+ package version 2.2.28 (Camacho et al., 

99 2009). Blastp hits were retained only if the sequence similarity was >80% over at least 100 amino 

100 acids, and e-value <10~ 10 . As the number of transcripts matching a given reference increases, so may 

101 assembly quality. Lastly, because the effects of trimming may vary with expression, I estimated 

102 expression (e.g. FPKM) for each assembled contig using default settings of the the program eXpress 

103 vl.5.0 (Roberts and Pachter, 2013) and the BAM file produced by Bowtie2 as described above. Code 
w for performing the subsetting, trimming, assembly, peptide and ORF prediction and blast analyses can 

105 be found in the following Github folder 

106 https : //github . com/macmanes/trimming_paper/tree/recreate_ms_analyses/scripts. 

107 Results 

los Quality trimming of sequence reads had a relatively large effect on the total number of errors 

109 contained in the final assembly (Figure 1), which was reduced by between 9 and 26% when comparing 

no the assemblies of untrimmed versus Phred=20 trimmed sequence reads. Most of the improvement in 

in accuracy is gained when trimming at the level of Phred=5 or greater, with modest improvements 

ii2 potentially garnered with more aggressive trimming at certain coverage levels (Table 1). 



ii3 Figure 1 
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lu Figure 1. The number of nucleotide errors contained in the final transcriptome assembly, 

115 normalized to assembly size, is related to the strength of quality trimming. This patterns is largely 

116 unchanged with varying depth of sequencing coverage (10 million to 100 million sequencing reads). 

117 Trimming at Phred = 5 may be optimal, given the potential untoward effects of more stringent 
us quality trimming. 10M, 20M, 50M, 75M, 100M refer to the subsamples size. 10M replicate is the 

119 technical replicate, 10M alt. dataset is the secondary dataset. Note that to enhance clarity, the 

120 Y-axis does not start at zero. 



121 In de Bruijn graph-based assemblers, the kmer is the fundamental unit of assembly. Even in 

122 transcriptome datasets, unique kmers are likely to be formed as a results of sequencing error, and 

123 therefore may be removed during the trimming process. Figure 2A shows the pattern of unique kmer 

124 loss across the various trimming levels and read datasets. What is apparent, is that trimming at 

125 Phred=5 removes a large fraction of unique kmers, with either less- or more-aggressive trimming 

126 resulting in smaller effects. In contrast to the removal of unique kmers, those kmers whose frequency 

127 is >1 are more likely to be real, and therefore should be retained. Figure 2B shows that while 

128 Phred=5 removes unique kmers, it may also reduce the number of non-unique kmers, which may 



Downloaded from http://biorxiv.org/on September 18, 2014 



129 hamper the assembly process. 

130 Figure 2 




o J 



No Trim P=2 P=5 P=10 P=20 No Trim P=2 P=5 P=10 P=20 

131 Figure 2A. The number of unique kmers removed with various trimming levels across all datasets. 

132 Trimming at Phred=5 results in a substantial loss of likely erroneous kmers, while the effect of 

133 more and less aggressive trimming is more diminished. 2B depicts the relationship between 

134 trimming and non-unique kmers, whose pattern is similar to that of unique kmers. 



135 In addition to looking at nucleotide error and kmer distributions, assembly quality may be measured by 

136 the the proportion of sequencing reads that map concordantly to a given transcriptome assembly 

137 (Hunt et al., 2013). As such, the analysis of assembly quality includes study of the mapping rates. 

138 Here, I found small but important effects of trimming. Specifically, assembling with aggressively 

139 quality trimmed reads decreased the proportion of reads that map concordantly. For instance, the 

wo percent of reads successfully mapped to the assembly of 10 million Q20 trimmed reads was decreased 

141 by 0.6% or approximately 1.4 million reads (compared to mapping of untrimmed reads) while the 

142 effects on the assembly of 100 million Q20 trimmed reads was more blunted, with only 381,000 fewer 
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143 reads mapping. Though the differences in mapping rates are exceptionally small, when working with 

144 extremely large datasets, the absolute difference in reads utilization may be substantial. 

145 Analysis of assembly content painted a similar picture, with trimming having a relatively small, though 

146 tangible effect. The number of BLAST+ matches decreased with stringent trimming (Figure 3), 

147 with trimming at Phred=20 associated with particularly poor performance. The maximum number 
we of BLAST hits for each dataset were 10M=27452 hits, 20M=29563 hits, 50M=31848 hits, 

149 75M=32786 hits, and 100M=33338 hits. 



Figure 3 
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151 Figure 3. The number of unique Blast matches contained in the final transcriptome assembly is 

152 related to the strength of quality trimming, with more aggressive trimming resulting in worse 

153 performance. Data are normalized to the number of BLAST hits obtained in the most favorable 

154 trimming level for each dataset. Negative numbers indicate the detrimental affect of trimming. 

155 10M, 20M, 50M, 75M, 100M refer to the subsamples size. 10M replicate is the technical replicate, 

156 10M alt. dataset is the secondary dataset. 
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157 When counting complete open reading frames recovered in the different assemblies, all datasets were 

158 all worsened by aggressive trimming, as evidenced by negative values in Figure 4. Trimming at 

159 Phred=20 was the most poorly performing level at all read depths. The maximum number of 
wo complete open reading frames for each dataset were 10M=11429 ORFs, 20M=19463 ORFs, 

lei 50M=35632 ORFs, 75M=42205 ORFs, 100M=48434 ORFs. 

162 Figure 4 




163 Figure 4. The number of complete exons contained in the final transcriptome assembly is related to 

164 the strength of quality trimming for any of the studied sequencing depths, Trimming at 

165 Phred=20 was always associated with poor performance. Data are normalized to the number of 

166 complete exons obtained in the most favorable trimming level for each dataset. Negative numbers 
16? indicate the detrimental affect of trimming. 10M, 20M, 50M, 75M, 100M refer to the subsamples 
168 size. 10M replicate is the technical replicate, 10M alt. dataset is the secondary dataset. 



169 Of note, all assembly files will be deposited in Dryad upon acceptance for publication. Until then, they 
no can be accessed via https://www.dropbox.com/sh/oiem0v5jgr5c5ir/TYQdGcpYwP. 
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i7i Discussion 

m Although the process of nucleotide quality trimming is commonplace in HTS analysis pipelines, 

173 particularly those involving assembly, its optimal implementation has not been well defined. Though 

174 the rigor with which trimming is performed seems to vary, there is a bias towards stringent trimming 

17 5 (Ansell et al., 2013; Barrett and Davis, 2012; Straub et al., 2013; Tao et al., 2013). This study 

176 provides strong evidence that stringent quality trimming of nucleotides whose quality scores are < 20 

177 results in a poorer transcriptome assembly across the majority metrics. Instead, researchers interested 

178 in assembling transcriptomes de novo should elect for a much more gentle quality trimming, or no 

179 trimming at all. Table 1 summarizes my finding across all experiments, where the numbers represent 

180 the trimming level that resulted in the most favorable result. What is apparent, is that for 

181 typically-sized datasets, trimming at Phred=2 or Phred=5 optimizes assembly quality. The 

182 exception to this rule appears to be in studies where the identification of SNP markers from high (or 

183 very low) coverage datasets is the primary goal. 

184 Table 1 



Dataset Size 


Error 


Map 


Orf 


Blast 


10M 


20 


0 


2 


2 


10M REP. 


20 


2 


2 


2 


10M ALT 


20 


2 


0 


0 


20M 


5 


5 


2 


2 


50M 


5 


10 


5 


2 


75M 


20 


10 


5 


0 


100M 


20 


0 


2 


2 



186 Table 1. The Phred trimming levels that resulted in optimal assemblies across the 4 metrics 

187 tested in the different size datasets. Error= the number of nucleotide errors in the assembly. 

188 Map= the number of concordantly mapped reads. ORF= the number of ORFs identified. 

189 BLAST= the number of unique BLAST hits. 10M rep. is the technical replicate, 10M alt. is the 

190 secondary dataset. 
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191 The results of this study were surprising. In fact, much of my own work assembling transcriptomes 

192 included a vigorous trimming step. That trimming had generally small effects, and even negative 

193 effects when trimming at Phred=20 was unexpected. To understand if trimming changes the 

194 distribution of quality scores along the read, we generated plots with the program SolexaQA (Cox 

195 et al . , 2010). Indeed, the program modifies the distribution of Phred scores in the predicted fashion 

196 yet downstream effects are minimal. This should be interpreted as speaking to the performance of the 

197 the bubble popping algorithms included in Trinity and other de Bruijn graph assemblers. 

198 The majority of the results presented here stem from the analysis of a single lllumina dataset and 

199 specific properties of that dataset may have biased the results. Though the dataset was selected for its 

200 'typical' lllumina error profile, other datasets may produce different results. To evaluate this possibility, 

201 a second dataset was evaluated at the 10M subsampling level. Interestingly, although the assemblies 

202 based on this dataset contained more error (e.g. Figure 1), aggressive trimming did not improve quality 

203 for any of the assessed metrics, though like other datasets, the absolute number of errors were reduced. 

204 In addition to the specific dataset, the subsampling procedure may have resulted in undetected biases. 

205 To address these concerns, a technical replicate of the original dataset was produced at the 10M 

206 subsampling level. This level was selected as a smaller sample of the total dataset is more likely to 

207 contain an unrepresentative sample than larger samples. The results, depicted in all figures as the solid 

208 purple line, are concordant. Therefore, I believe that sampling bias is unlikely to drive the patterns 

209 reported on here. 

210 What is missing in trimmed datasets? — The question of differences in recovery of specific 

211 contigs is a difficult question to answer. Indeed, these relationships are complex, and could involve a 

212 stochastic process, or be related to differences in expression (low expression transcripts lost in trimmed 

213 datasets) or length (longer contigs lost in trimmed datasets). To investigate this, I attempted to 

214 understand how contigs recovered in the 10 million read untrimmed dataset, but not in the 

215 Phred=20 trimmed dataset were different. Using the information on FPKM and length generated by 

216 the program eXpress, it was clear that the transcripts unique to the untrimmed dataset were more 

217 lowly expressed (mean FPKM=3.2) when compared to the entire untrimmed dataset (mean 

218 FPKM=11.1; W = 18591566, p-value = 7.184e-13, non-parametric Wilcoxon test). 
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219 I believe that the untoward effects of trimming are linked to a reduction in coverage. For the datasets 

220 tested here, trimming at Phred=20 resulted in the loss of nearly 25% of the dataset, regardless of 

221 the size of the initial dataset. This relationship does suggest, however, that the magnitude of the 

222 negative effects of trimming should be reduced in larger datasets, and in fact may be completely erased 

223 with ultra-deep sequencing. Indeed, when looking at the differences in the magnitude of negative 

224 effects in the datasets presented here, it is apparent that trimming at Phred=20 is 'less bad' in the 

225 100M read dataset than it is in the 10M read datasets. For instance, Figure 2B demonstrates that one 

226 of the untoward effects of trimming, the reduction of non-unique kmers, is reduced as the depth of 

227 sequencing is increased. Figures 3 and 4 demonstrate a similar pattern, where the negative effects of 

228 aggressive trimming of higher coverage datasets are blunted relative to lower coverage datasets. 

229 Turning my attention to length, when comparing uniquely recovered transcripts to the entire 

230 untrimmed dataset of 10 million reads, it appears to be the shorter contigs (mean length 857nt versus 

231 954nt; W = 26790212, p-value <2.2e-16) that are differentially recovered in the untrimmed dataset 

232 relative to the Phred=20 trimmed dataset. 

233 Effects of coverage on transcriptome assembly — Though the experiment was not 

234 designed to evaluate the effects of sequencing depth on assembly, the data speak well to this issue. 

235 Contrary to other studies, suggesting that 30 million paired end reads were sufficient to cover 

236 eukaryote transcriptomes (Francis et al., 2013), the results of the current study suggest that assembly 

237 content was more complete as sequencing depth increased; a pattern that holds at all trimming levels. 

238 Though the suggested 30 million read depth was not included in this study, all metrics, including the 

239 number of assembly errors, as well as the number of exons, and BLAST hits were improved as read 

240 depth increased. While generating more sequence data is expensive, given the assembled 

241 transcriptome reference often forms the core of future studies, this investment may be warranted. 

242 Should quality trimming be replaced by unique kmer filtering? — For transcriptome 

243 studies that revolve around assembly, quality control of sequence data has been thought to be a 

244 crucial step. Though the removal of erroneous nucleotides is the goal, how best to accomplish this is 

245 less clear. As described above, quality trimming has been a common method, but in its commonplace 

246 usage, may be detrimental to assembly. What if, instead of relying on quality scores, we instead rely 
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247 on the distribution of kmers to guide our quality control endeavors? In transcriptomes of typical 

248 complexity, sequenced to even moderate coverage, it is reasonable to expect that all but the most 

249 exceptionally rare mRNA molecules are sequenced at a depth >1. Following this, all kmer whose 

250 frequency is <2 are putative errors, and should be removed before assembly, though this process may 

251 result in the loss of kmers from extremely low abundance transcripts or isoforms. This idea and its 

252 implementation are fodder for future research. 

253 In summary, the process of nucleotide quality trimming is commonplace in many HTS analysis 

254 pipelines, but its optimal implementation has not been well defined. A very aggressive strategy, where 

255 sequence reads are trimmed when Phred scores fall below 20 is common. My analyses suggest that 

256 for studies whose primary goal is transcript discovery, that a more gentle trimming strategy (e.g. 

257 Phred=2 or Phred=5) that removes only the lowest quality bases is optimal. In particular, it 

258 appears as if the shorter and more lowly expressed transcripts are particularly vulnerable to loss in 

259 studies involving more harsh trimming. The one potential exception to this general recommendation 

260 may be in studies of population genomics, where deep sequencing is leveraged to identify SNPs. Here, 

261 a more stringent trimming strategy may be warranted. 
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